Smooth Contextual Bandits: Bridging the Parametric and Nondifferentiable Regret Regimes

نویسندگان

چکیده

Dynamic Personalized Decision Making Beyond the Super-Extrapolatable and Super-Local Cases Contextual bandit problems model inherent trade-off between exploration exploitation in personalized decision making marketing, healthcare, revenue management, more. Specifically, is characterized by optimal growth rate of regret. Intuitively, should depend on how complex underlying supervised learning problem is, namely, much can observing reward one context tell us about mean rewards another. To formalize this intuitive relationship, Hu, Kallus, Mao study “Smooth Bandits: Bridging Parametric Nondifferentiable Regimes” a nonparametric contextual which functions are β-times differentiable (more generally, Hölder β-smooth). This interpolates two extremes previously studied isolation: nondifferentiable bandits (β ≤ 1), with running separated noncontextual different regions achieves rate-optimal regret, parametric-response = ∞), regret be achieved minimal or no because infinite extrapolatability across contexts. The authors develop algorithm that operates neither fully locally nor globally, revealing in-between smooth setting shedding light crucial interplay functional complexity dynamic making.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Regret Bounds for Oracle-Based Adversarial Contextual Bandits

We give an oracle-based algorithm for the adversarial contextual bandit problem, where either contexts are drawn i.i.d. or the sequence of contexts is known a priori, but where the losses are picked adversarially. Our algorithm is computationally efficient, assuming access to an offline optimization oracle, and enjoys a regret of order O((KT ) 2 3 (logN) 1 3 ), where K is the number of actions,...

متن کامل

Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits

We study contextual bandits with budget and time constraints under discrete contexts, referred to as constrained contextual bandits. The budget and time constraints significantly increase the complexity of exploration-exploitation tradeoff because they introduce coupling among contexts. Such coupling effects make it difficult to obtain oracle solutions that assume known statistics of bandits. T...

متن کامل

Open Problem: First-Order Regret Bounds for Contextual Bandits

We describe two open problems related to first order regret bounds for contextual bandits. The first asks for an algorithm with a regret bound of Õ( √ L?K lnN) where there areK actions,N policies, andL? is the cumulative loss of the best policy. The second asks for an optimization-oracle-efficient algorithm with regret Õ(L ? poly(K, ln(N/δ))). We describe some positive results, such as an ineff...

متن کامل

Regret of Queueing Bandits

We consider a variant of the multiarmed bandit problem where jobs queue for service, and service rates of different servers may be unknown. We study algorithms that minimize queue-regret: the (expected) difference between the queue-lengths obtained by the algorithm, and those obtained by a “genie”-aided matching algorithm that knows exact service rates. A naive view of this problem would sugges...

متن کامل

Dueling Bandits with Weak Regret

We consider online content recommendation with implicit feedback through pairwise comparisons, formalized as the so-called dueling bandit problem. We study the dueling bandit problem in the Condorcet winner setting, and consider two notions of regret: the more well-studied strong regret, which is 0 only when both arms pulled are the Condorcet winner; and the less well-studied weak regret, which...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Operations Research

سال: 2022

ISSN: ['1526-5463', '0030-364X']

DOI: https://doi.org/10.1287/opre.2021.2237